Skip to content

Enhance security by fixing ReDoS and preventing AST injection#252

Closed
iapoorv01 wants to merge 1 commit into
pathwaycom:mainfrom
iapoorv01:fix-jmespath-injection
Closed

Enhance security by fixing ReDoS and preventing AST injection#252
iapoorv01 wants to merge 1 commit into
pathwaycom:mainfrom
iapoorv01:fix-jmespath-injection

Conversation

@iapoorv01

Copy link
Copy Markdown
Contributor

Context

This Pull Request addresses a critical CWE-94 (Improper Control of Generation of Code / Code Injection) vulnerability within the _get_jmespath_filter UDF in python/pathway/xpacks/llm/document_store.py.

Previously, the filepath_globpattern argument was directly interpolated into a JMESPath single-quoted literal without applying any validation, encoding, or character restrictions. Because JMESPath engines lack native backslash-escaping mechanisms inside single-quoted literals, an unauthenticated user could construct a payload (e.g., x', path) || true || globmatch('x) that drops out of the string literal context. This allowed attackers to append structural logical operators (like || and &&), breaking out of the intended Abstract Syntax Tree (AST) structure.

In multi-tenant vector store topologies where data isolation is enforced via silent metadata_filter boundaries, this vulnerability enabled a complete isolation bypass via short-circuit evaluation.

Approaches Considered:

  1. String/Quote Escaping (e.g., replacing ' with \'): Rejected. JMESPath parsers handle escaping inconsistently depending on the underlying driver and engine implementation. This approach carries a high regression risk for edge cases.
  2. Regex Allowlisting (Chosen Approach): By enforcing a strict regex boundary (^[a-zA-Z0-9_\-\*\?\.\/\\ ]+$), we completely eliminate the possibility of AST corruption while natively supporting standard file path queries, wildcards, spaces, and nested directories. This guarantees structural query integrity and prevents unauthorized logic from reaching the evaluator.

How has this been tested?

  • Unit Testing: Implemented test_get_jmespath_filter_structural_integrity within test_document_store.py (explicitly calling __wrapped__ to avoid pw.udf decorator test failures) to verify:
    1. Valid, deterministic glob patterns succeed safely.
    2. Syntactically malicious payloads containing logic modifiers (e.g., x', path) || true) are aggressively blocked, raising a strict ValueError.
  • Linting & CI Constraints: Validated that all code formatting strictly adheres to the repository's black and flake8 standards to ensure 100% CI compliance.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature or improvement (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Related issue(s):

  1. Fixes [Bug]: Lacking validation on filepath_globpattern enables structural modification/logical bypass of JMESPath strings #251

Checklist:

  • My code follows the code style of this project,
  • My change requires a change to the documentation,
  • I described the modification in the CHANGELOG.md file.

@iapoorv01 iapoorv01 force-pushed the fix-jmespath-injection branch 2 times, most recently from 171e9d5 to fae7aae Compare June 29, 2026 19:53
@zxqfd555

Copy link
Copy Markdown
Collaborator

Thanks for the contribution, but I don't think we can take this as-is — both the threat model and the chosen fix have problems.

1. No trust boundary is being crossed. metadata_filter and filepath_globpattern come from the same client query schema (RetrieveQuerySchema / FilterSchema) at the same trust level. metadata_filter is, by design, an arbitrary JMESPath expression supplied by the caller. So a client that can set filepath_globpattern can already set metadata_filter to anything it wants — there is nothing to "short-circuit past". The "multi-tenant isolation enforced via a silent metadata_filter" scenario isn't how the Document Store API actually works.

2. Not code injection / CWE-94. JMESPath here is a read-only filter language (jmespath.search over in-memory metadata in vector_store.py / _knn_lsh.py), not eval. There's no code execution; worst case is filter manipulation, which (see point 1) the caller already fully controls.

3. The allowlist breaks legitimate glob patterns — functional regression. globmatch is backed by fnmatch per path level (_knn_lsh.py), which supports character classes and other glob syntax. The regex ^[a-zA-Z0-9_\-\*\?\.\/\\ ]+$ rejects valid inputs:

'data[0-9]/*.pdf'  -> valid glob, rejected
'*.{pdf,txt}'      -> rejected
'file[!a].pdf'     -> rejected

These are normal user queries that would now raise ValueError mid-pipeline.

4. Inconsistent even on its own terms. If literal-breakout were the concern, the stronger vector is metadata_filter (full JMESPath, also string-interpolated at line 37-40), which this PR leaves untouched. It hardens the weaker of the two identical channels.

5. Title mismatch. The title mentions a "ReDoS" fix, but nothing in the diff addresses ReDoS.

Minor: raising ValueError inside a pw.udf errors the row/pipeline rather than rejecting the query gracefully, and the test reaches into .__wrapped__.

If we did want to harden the string interpolation for robustness, the right direction would be escaping the single quote in the value (as is already partially done for metadata_filter) or building the expression via parameterization — not an allowlist that strips valid glob syntax. Given the above, I'm going to close this. Thanks again for looking into it.

@zxqfd555 zxqfd555 closed this Jun 30, 2026
@iapoorv01

Copy link
Copy Markdown
Contributor Author

Thanks for the contribution, but I don't think we can take this as-is — both the threat model and the chosen fix have problems.

1. No trust boundary is being crossed. metadata_filter and filepath_globpattern come from the same client query schema (RetrieveQuerySchema / FilterSchema) at the same trust level. metadata_filter is, by design, an arbitrary JMESPath expression supplied by the caller. So a client that can set filepath_globpattern can already set metadata_filter to anything it wants — there is nothing to "short-circuit past". The "multi-tenant isolation enforced via a silent metadata_filter" scenario isn't how the Document Store API actually works.

2. Not code injection / CWE-94. JMESPath here is a read-only filter language (jmespath.search over in-memory metadata in vector_store.py / _knn_lsh.py), not eval. There's no code execution; worst case is filter manipulation, which (see point 1) the caller already fully controls.

3. The allowlist breaks legitimate glob patterns — functional regression. globmatch is backed by fnmatch per path level (_knn_lsh.py), which supports character classes and other glob syntax. The regex ^[a-zA-Z0-9_\-\*\?\.\/\\ ]+$ rejects valid inputs:

'data[0-9]/*.pdf'  -> valid glob, rejected
'*.{pdf,txt}'      -> rejected
'file[!a].pdf'     -> rejected

These are normal user queries that would now raise ValueError mid-pipeline.

4. Inconsistent even on its own terms. If literal-breakout were the concern, the stronger vector is metadata_filter (full JMESPath, also string-interpolated at line 37-40), which this PR leaves untouched. It hardens the weaker of the two identical channels.

5. Title mismatch. The title mentions a "ReDoS" fix, but nothing in the diff addresses ReDoS.

Minor: raising ValueError inside a pw.udf errors the row/pipeline rather than rejecting the query gracefully, and the test reaches into .__wrapped__.

If we did want to harden the string interpolation for robustness, the right direction would be escaping the single quote in the value (as is already partially done for metadata_filter) or building the expression via parameterization — not an allowlist that strips valid glob syntax. Given the above, I'm going to close this. Thanks again for looking into it.

Thanks for the incredibly detailed breakdown, Sergey!

That makes total sense—I had incorrectly assumed metadata_filter was a backend-enforced boundary for multi-tenancy, but since both inputs share the exact same client trust boundary, you are completely right that this isn't an isolation bypass. Great catch on the fnmatch character classes as well; my regex allowlist was definitely way too strict and would have broken valid queries.

I really appreciate you taking the time to explain the architectural context so thoroughly. I completely agree with closing this one, and I've just updated the ReDoS fix over on #250 with the performance test and changelog updates you requested!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Lacking validation on filepath_globpattern enables structural modification/logical bypass of JMESPath strings

2 participants